home *** CD-ROM | disk | FTP | other *** search
- Path: bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv
- From: andrewh@speech.su.oz.au (Andrew Hunt)
- Newsgroups: comp.speech,comp.answers,news.answers
- Subject: comp.speech Frequently Asked Questions - part 3/3
- Supersedes: <comp-speech-faq/part3_764040899@rtfm.mit.edu>
- Followup-To: comp.speech
- Date: 16 Apr 1994 13:08:05 GMT
- Organization: Speech Technology Group, The University of Sydney
- Lines: 996
- Approved: news-answers-request@MIT.Edu
- Expires: 28 May 1994 13:05:48 GMT
- Message-ID: <comp-speech-faq/part3_766501548@rtfm.mit.edu>
- References: <comp-speech-faq/part1_766501548@rtfm.mit.edu>
- Reply-To: andrewh@speech.su.oz.au (Andrew Hunt)
- NNTP-Posting-Host: bloom-picayune.mit.edu
- Summary: Useful information about Speech Technology
- X-Last-Updated: 1994/04/06
- Originator: faqserv@bloom-picayune.MIT.EDU
- Xref: bloom-beacon.mit.edu comp.speech:2285 comp.answers:4934 news.answers:18148
-
- Archive-name: comp-speech-faq/part3
- Last-modified: 1994/04/06
-
-
- SECTION 5 - Speech Synthesis
-
- Q5.1: What is speech synthesis?
-
- Speech synthesis is the task of transforming written input to spoken output.
- The input can either be provided in a graphemic/orthographic or a phonemic
- script, depending on its source.
-
- ------------------------------------------------------------------------
-
- Q5.2: How can speech synthesis be performed?
-
- There are several algorithms. The choice depends on the task they're used
- for. The easiest way is to just record the voice of a person speaking the
- desired phrases. This is useful if only a restricted volume of phrases and
- sentences is used, e.g. messages in a train station, or schedule information
- via phone. The quality depends on the way recording is done.
-
- More sophisticated but worse in quality are algorithms which split the
- speech into smaller pieces. The smaller those units are, the less are they
- in number, but the quality also decreases. An often used unit is the phoneme,
- the smallest linguistic unit. Depending on the language used there are about
- 35-50 phonemes in western European languages, i.e. there are 35-50 single
- recordings. The problem is combining them as fluent speech requires fluent
- transitions between the elements. The intellegibility is therefore lower, but
- the memory required is small.
-
- A solution to this dilemma is using diphones. Instead of splitting at the
- transitions, the cut is done at the center of the phonemes, leaving the
- transitions themselves intact. This gives about 400 elements (20*20) and
- the quality increases.
-
- The longer the units become, the more elements are there, but the quality
- increases along with the memory required. Other units which are widely used
- are half-syllables, syllables, words, or combinations of them, e.g. word stems
- and inflectional endings.
-
- ------------------------------------------------------------------------
-
- Q5.3: What are some good references/books on synthesis?
-
- The following are good introductory books/articles.
-
- Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal Processing,
- 1987.
-
- D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of
- the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793.
-
- I. H. Witten. Principles of Computer Speech.
- (London: Academic Press, Inc., 1982).
-
- John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech:
- The MITalk System", Cambridge University Press, 1987.
-
- ------------------------------------------------------------------------
-
- Q5.4: What software/hardware is available?
-
- In the last year there has been a great increase in the release of speech
- synthesis software - both commercial and public domain. The following is
- a list of as many products/packages as I can find out about. Any help in
- keeping this list up-to-date will be appreciated.
-
-
-
- Package: ORATOR Text-to-Speech Synthesizer
- Platform: SUN SPARC, Decstation 5000. Portable to other UNIX platforms.
- Description: Sophisticated speech synthesis package. Has text preprocessing
- (for abbreviations, numbers), acronym citation rules, and human-like
- spelling routines. High accuracy for pronunciation of names of
- people, places and businesses in America, text-to-speech translation
- for common words; rules for stress and intonation marking, based on
- natural-sounding demisyllable synthesis; various methods of user
- control and customization at most stages of processing. Currently,
- ORATOR is most appropriate for applications containing a large
- component of names in the text, and requires some amount of user-
- specified text-preprocessing to produce good quality speech for
- general text.
- Hardware: Standard audio output of SPARC, or Decstation audio hardware.
- At least 16M of memory recommended.
- Cost: Binary License: $5,000.
- Source license for porting or commercial use: $30,000.
- Availability: Contact Bellcore's Licensing Office (1-800-527-1080)
- or email: jzilg@cc.bellcore.com (John Zilg)
-
-
- Package: Text to phoneme program (1)
- Platform: unknown
- Description: Text to phoneme program. Based on Naval Research Lab's
- set of text to phoneme rules.
- Availability: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory
- /pub/src/phon.tar.Z
-
-
- Package: Text to phoneme program (2)
- Platform: unknown
- Description: Text to phoneme program.
- Availability: By FTP from "wuarchive.wustl.edu" in the file
- /mirrors/unix-c/utils/phoneme.c
-
-
- Package: Text to phoneme program (3)
- Description: A public domain version of the same Naval Research Lab
- text to phoneme rules.
- Availability: By anonymous ftp from
- svr-ftp.eng.cam.ac.uk:comp.speech/sources/english2phoneme.shar
-
-
- Package: Text to speech program
- Description: A implementation of the Klatt phoneme to waveform speech
- synthesiser.
- Availability: By anonymous ftp from
- svr-ftp.eng.cam.ac.uk:comp.speech/sources/klatt-0.02.tar.Z
-
-
- Package: "Speak" - a Text to Speech Program
- Platform: Sun SPARC
- Description: Text to speech program based on concatenation of pre-recorded
- speech segments. A function library can be used to integrate
- speech output into other code.
- Hardware: SPARC audio I/O
- Availability: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z
-
-
- Package: TheBigMouth - a Text to Speech Program
- Platform: NeXT
- Description: Text to speech program based on concatenation of pre-recorded
- speech segments. NeXT equivalent of "Speak" for Suns.
- Availability: try NeXT archive sites such as sonata.cc.purdue.edu.
-
-
- Package: TextToSpeech Kit
- Platform: NeXT Computers
- Description: The TextToSpeech Kit does unrestricted conversion of English
- text to synthesized speech in real-time. The user has control over
- speaking rate, median pitch, stereo balance, volume, and intonation
- type. Text of any length can be spoken, and messages can be queued
- up, from multiple applications if desired. Real-time controls such
- as pause, continue, and erase are included. Pronunciations are
- derived primarily by dictionary look-up. The Main Dictionary has
- nearly 100,000 hand-edited pronunciations which can be supplemented
- or overridden with the User and Application dictionaries. A number
- parser handles numbers in any form. A letter-to-sound knowledge base
- provides pronunciations for words not in the Main or customized
- dictionaries. Dictionary search order is under user control.
- Special modes of text input are available for spelling and emphasis
- of words or phrases. The actual conversion of text to speech is done
- by the TextToSpeech Server. The Server runs as an independent task
- in the background, and can handle up to 50 client connections.
- Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the
- User Kit. The Developer Kit enables developers to build and test
- applications which incorporate text-to-speech. It includes the
- TextToSpeech Server, the TextToSpeech Object, the pronunciation
- editor PrEditor, several example applications, phonetic fonts,
- example source code, and developer documentation. The User Kit
- provides support for applications which incorporate text-to-speech.
- It is a subset of the Developer Kit.
- Hardware: Uses standard NeXT Computer hardware.
- Cost: TextToSpeech User Kit: $175 CDN ($145 US)
- TextToSpeech Developer Kit: $350 CDN ($290 US)
- Upgrade from User to Developer Kit: $175 CDN ($145 US)
- Availability: Trillium Sound Research
- 1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3
- Tel: (403) 284-9278 Fax: (403) 282-6778
- Order Desk: 1-800-L-ORATOR (US and Canada only)
- Email: TTSInfo@trillium.ab.ca
-
-
- Package: SGI Developers Toolbox Synthesiser
- Platform: SGI
- Description: The SGI Developer Toolbox 4.0 CDROM contains a basic
- public domain text-to-speech program in the publics/speak
- directory. The directory includes man pages and source.
- Availability: on the SGI Developer Toolbox 4.0 CDROM
-
-
- Package: rsynth
- Platform: Various (including Sun, Linux, NeXT, SGI)
- Description: Text-to-speech converter produced by combination of
- various public-domain pieces.
- Price: Free
- Availability: by anonymous ftp from
- svr-ftp.eng.cam.ac.uk:/comp.speech/sources/rsynth-1.0.tar.Z
- svr-ftp.eng.cam.ac.uk:/comp.speech/sources/rsynth-1.0.tar.gz
-
-
- Package: SENSYN speech synthesizer
- Platform: PC, Mac, Sun, and NeXt
- Rough Cost: $300
- Description: This formant synthesizer produces speech waveform files
- based on the (Klatt) KLSYN88 synthesizer. It is intended
- for laboratory and research use. Note that this is NOT a
- text-to-speech synthesizer, but creates speech sounds based
- upon a large number of input variables (formant frequencies,
- bandwidths, glottal pulse characteristics, etc.) and would
- be used as part of a TTS system. Includes full source code.
- Availability: Sensimetrics Corporation, 64 Sidney Street, Cambridge MA 02139.
- Fax: (617) 225-0470; Tel: (617) 225-2442.
- Email: sensimetrics@sens.com
-
-
- Package: SPCHSYN.EXE
- Platform: PC?
- Availability: By anonymous ftp from evans.ee.adfa.oz.au (131.236.30.24)
- in /mirrors/tibbs/Applications/SPCHSYN.EXE
- It is a self extracting DOS archive.
- Requirements: May require special TI product(s), but all source is there.
-
-
- Package: CSRE: Canadian Speech Research Environment
- Platform: PC
- Cost: Distributed on a cost recovery basis
- Description: CSRE is a software system which includes in addition to the
- Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL
- SYSTEM. A paper about the whole package can be found in:
- Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc.
- of the Second Intl. Conf. on Spoken Language Processing, Edmonton:
- University of Alberta, pp. 1127-1130.
- Hardware: Can use a range of data aqcuisition/DSP
- Availability: For more information about the availability of this software
- contact Krystyna Marciniak - email march@uwovax.uwo.ca
- Tel (519) 661-3901 Fax (519) 661-3805.
- For technical information email ramji@uwovax.uwo.ca
- Note: A more detailed description is given in Q1.8 on speech environments.
-
-
- Package: Eloquence (currently an alpha release)
- Platform: Windows and Solaris
- Description: Software based text-to-speech package. Generates waveforms
- completely algorithmically instead of by concatenating waveforms,
- for maximum flexibility and naturalism. For instance, when the
- user requests a deeper voice, the software simulates a larger vocal
- tract, instead of simply pitch-shifting samples.
- Uses high-level linguistic parsing, which obviates the need for a
- huge dictionary. Handles numbers, acronyms, currency, etc.
- Includes a set of annotation symbols, for placing stress on particular
- words, expressing excitement/boredom, etc. Also allows phonetic input.
- The final version, including support for Windows DDE and OLE and
- UNIX Sockets, will be released by the end of 1994.
- Produces male and female voices for General American English.
- Dialects under development include Alabama, Brooklyn, and Boston.
- Price: $5000 (unconfirmed)
- Availability: Eloquent Technology, Inc.
- 24 Highgate Circle
- Ithaca, NY 14850
- Ph: (607) 257-6829 Fax: (607) 272-0058
-
-
- Package: JSRU
- Platform: UNIX and PC
- Cost: 100 pounds sterling (from academic institutions and industry)
- Description: A C version of the JSRU system, Version 2.3 is available.
- It's written in Turbo C but runs on most Unix systems with very
- little modification. A Form of Agreement must be signed to say
- that the software is required for research and development only.
- Contact: Dr. E.Lewis (eric.lewis@uk.ac.bristol)
-
-
- Package: Klatt-style synthesiser
- Platform: Unix
- Cost: Free
- Description: Software posted to comp.speech in late 1992.
- Availability: By anonymous ftp from the comp.speech archives as
- svr-ftp.eng.cam.ac.uk:/comp.speech/sources/klatt-0.02.tar.Z
-
-
- Package: Speech Manager and PlainTalk
- Platform: Macintosh
- Cost: Free
- Description: Apple's new text-to-speech system extension(s) that enable
- applications (listed below) to perform text-to-speech
- conversion. The Speech Manager runs on most Macs, but PlainTalk
- (and the high quality voices) requires a 68020 Mac or better.
- Availability: By anonymous ftp from:
- ftp.apple.com:/dts/mac/sys.soft/speech
- There are 3 files in this directory:
- 6273632 Aug 14 22:51 macintalk-pro.hqx
- PlainTalk Text-To-Speech 1.0 speech synthesizer
- extension (includes Female Voice, Compressed);
- TTS Female Voice; TTS Male Voice; and
- TTS Male Voice, Compressed. Requires 68020 or better!
- 370108 Aug 13 04:30 speech-manager-docs.hqx
- Apple DocViewer format (Inside Macintosh style,
- no installation instructions - just drag everything
- onto your closed System Folder).
- 262569 Aug 7 07:01 speech-manager.hqx
- Speech Manager 1.1.1 (includes Marvin's voice) and
- MacInTalk Voices 1.1.1 (9 more voices). Runs most Macs.
-
-
- Package: Various Mac Speech Output Applications
- Platform: Macintosh
- Cost: Free (except for At Ease)
- Description: Some of the Speech Manager aware text-to-speech (TTS)
- applications, etc. are listed below (there are more on the
- Apple Developer CD-ROMs).
-
- Application, etc. Source Comments
- _________________ ________ _________________________________________________
- AddressSpeech info-mac 4D talking address book (from Speech Pack 2.0)
- At Ease 2.0 MacWarehouse Friendly desktop that speaks file names
- At Ease 2.0 WG MacWarehouse Friendly desktop that speaks file names
- Eliza 3.1 AOL Talking Eliza (Rogerian psych therapist)
- FB speech Inside Basic Mag, volume 3, no. 6. FutureBasic demo
- FB Speech demo Inside Basic Mag, volume 3, no. 7. FutureBasic demo
- Fortune 1.1 info-mac Like a talking UNIX fortune command - slick
- Homer 0.92d9 zaphod.ee.pitt.edu GUI IRC client, assign nicks voices - slick
- MacMessage 1.0 FirstClassBBS Share talking messages/customizable startup
- Say info-mac MPW Tool which converts standard input to speech
- ScriptTools 1.2 info-mac Write AppleScript scripts to say text messages
- Siege Watch 1.01f info-mac Wryly political speaking clock
- SoToSpeak1.0.0b10 info-mac Two voice conversation (also see Fortune's About)
- Speak It! info-mac Type in a message and have it spoken
- Speaker 1.11 info-mac Simple text file editor, speaks on <CR>, macros
- Speecher 1.2.1 info-mac Customizable word pronunciation/substitution
- SpeechManagerdemo info-mac Command line interface, C source, aka -explorer
- Speech Pack 2.0 info-mac 4th Dimension external, add speech to database
- SpeechUnitEx info-mac Pascal source code for speech in Lab 7
- speek-02b info-mac Speech XCMD for HyperCard
- TalkingClockPro2.0info-mac AppleScriptable talking clock extension (2.0b0)
- TeachText 7.2 AV Mac Apple's talking TeachText (simple editor w/QT)
- Tex-Edit 1.9 AOL Talking word processor, McSink like, modeming
- VoiceDemo 1.0.1 info-mac Bare bones phrase talker
- Welcome!v1.3.1 info-mac A talking Welcome to Macintosh startup
- ? ? Talking Plug-In-Module for MS Word 5,
- experimental, unsupported, buggy, beware!
- Speech Rhythms AOL A cool text file for one of the above apps
- _____
- Sources:
- AOL = America Online
- info-mac = {ftp sumex-aim.stanford.edu, ftp wuarchive.wustl.edu, et al.}
- MacWarehouse = (800) 255-6227
-
- Apple's work in spoken language technologies and systems is described in:
- Lee, Kai-Fu. "The Conversational Computer: An Apple Perspective."
- (Keynote Speech) In Proc. Eurospeech in Berlin, ESCA, September, 1993.
-
-
- Package: MacinTalk
- Platform: Macintosh
- Cost: Free
- Description: Formant based speech synthesis.
- There is also a program called "tex-edit" which apparently
- can pronounce English sentences reasonably using Macintalk.
- Note: MacinTalk doesn't run reliably on Macintosh's with new
- sound hardware under the lastest OS (System 7.1 w/HUD 2.0).
- More recent software is listed above.
- Availability: By anonymous ftp from many archive sites (have a look on
- archie if you can). tex-edit is on many of the same sites. Try
- wuarchive.wustl.edu:/mirrors2/info-mac/Old/card/macintalk.hqx[.Z]
- /macintalk-stack.hqx[.Z]
- wuarchive.wustl.edu:/mirrors2/info-mac/app/tex-edit-15.hqx
-
-
- Package: Lernout & Hauspie Text-To-Speech SDK
- Platform: IBM-Compatible
- Description: The L&H Text-to-Speech software developers kit is able
- to integrate text-to-speech technology with your own or existing
- PC applications under Microsoft Windows 3.1. This software will
- allow conversion of written text into clear human sounding synthetic
- speech.
- Requirements: IBM-compatible PC 386 DX(33Mhz) or higher, 8Mb RAM,
- MS DOS 5.0(or higher), MS Windows 3.1 (or higher),
- Compiler and linker: Microsoft(R) Visual C++ or Borland C++,
- Windows(TM) 3.1 compatible sound card, preferably 16 bit
- e.g. Soundblaster, Windows Sounds System, Pro Audio Spectrum
- Price: Unconfirmed $1,999 per copy, and $499 per each additional language
- (American English, French, German, or Spanish).
- Contact: USA (617) 932-4118
-
-
- Package: Tinytalk
- Platform: PC
- Description: Shareware package is a speech 'screen reader' which is use
- by many blind users.
- Availability: By anonymous ftp from handicap.shel.isc-br.com.
- Get the files /speech/ttexe145.zip & /speech/ttdoc145.zip.
-
-
- Package: Narrator - narrator.device
- Platform: Amiga
- Description: Formant based speech synthesis. Includes a Engish-to-phoneme
- translation library, and a SPEAK: pseudo-device for speech
- output.
- Hardware: Standard Amiga hardware
- Availability: Part of AmigaOS
-
-
- Product Series: Infovox
- Description: Multilingual Text-to-speech systems, languages available:
- American English, British English, German, French, Spanish,
- Italian, Swedish, Norwegian, Icelandic, Danish and Finnish.
- Product name: INFOVOX 500, PC BOARD
- * Product description: Half length expansion board for IBM PC, XT, AT,
- PS/2 model 30 or compatible personal computers. The board can
- also be connected via the serial port. Language and control program
- for downloading into RAM or mounted on EPROMs.
- * Platform: for IBM PC, XT, AT, PS/2 model 30 or compatible
- Product name: INFOVOX 600, OEM BOARD
- * Product description: OEM board built with CMOS IC's. Language and
- control program are stored in on-board fixed memory.
- * Platform: any, Interface: 9-pole D-SUB (RS 232-C) 300-9600 Baud
- Product name: INFOVOX 700, DESKTOP UNIT
- * Product description: Desktop unit with built in Infovox 600 to be
- connected to any computer or terminal via an RS 232-C serial
- interface. Built in loudspeaker and rechargable battery for 4 hours
- use, and control knobs for continuous control of speech volume and
- speed.
- * Platform: any
- Product name: INFOVOX 650, OEM BOARD
- * Product description: OEM-board built with CMOS IC's. Language and
- control program are stored in on-board memory.
- * Platform:any, Interface: 9 pole D-SUB (RS 232-C) 300-9600 Baud
- Product name: INFOVOX 750, DESKTOP UNIT
- * Product description: Desktop unit with built in Infovox 650 to be
- connected to any computer or terminal via an RS 232-C serial
- interface. Built in loudspeaker and rechargable battery for 5 hours
- use, and a control knob for continuous control of speech volume.
- * Platform: any
- Misc: Infovox multi-lingual Text-to-Speech Technologies can interface with
- Apple's PlainTalk System. It enables Apple Third party developers
- to write application software with synthetic speech output using
- their usual Apple Plain Talk Text-to-Speech interface. Software
- already written for the English speaking market using Apple Plain
- Talk can be now distributed worldwide, provided message strings
- are translated.
- Contact: TELIA PROMOTOR INFOVOX AB
- TTS Sales Division
- P.O. Box 2069
- S-171 02 Solna, Sweden
- Ph: +46 8 764 35 00 Fax: +46 8 735 78 76
- email: tts-sales@infovox.se
-
-
- SIMTEL-20
- The following is a list of speech related software available from
- SIMTEL-20 and its mirror sites for PCs.
- The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20]
- Try looking at your nearest archive site first.
- Directory PD1:<MSDOS.VOICE>
- Filename Type Length Date Description
- ==============================================
- AUTOTALK.ARC B 23618 881216 Digitized speech for the PC
- CVOICE.ARC B 21335 891113 Tells time via voice response on PC
- HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth.
- HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker
- SAY.ARC B 20224 860330 Computer Speech - using phonemes
- SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes
- TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker
- TRAN.ARC B 39766 890715 Repeats typed text in digital voice
- VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs
- VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening
-
-
-
- Package: Bliss
- Contact: Dr. John Merus (Brown University) Mertus@browncog.bitnet
-
-
- Package: xxx
- Platform: (PC, Mac, Sun, NeXt etc)
- Rough Cost: (if appropriate)
- Description: (keep it brief)
- Hardware: (requirement list)
- Availability: (ftp info, email contact or company contact)
-
-
-
- Can anyone provide information on the following:
-
- MultiVoice
- Monolog
- TrueSpeech from DSP Group Inc.
- The range of recently released Windows products
-
- Please email or post suitable information for this list. Commercial,
- public domain and research packages are all appropriate.
-
-
-
- =======================================================================
-
- SECTION 6 - Speech Recognition
-
- Q6.1: What is speech recognition?
-
- Automatic speech recognition is the process by which a computer maps an
- acoustic speech signal to text.
-
- Automatic speech understanding is the process by which a computer maps an
- acoustic speech signal to some form of abstract meaning of the speech.
-
- ------------------------------------------------------------------------
-
- Q6.2: How can I build a very simple speech recogniser?
-
- Doug Danforth provides a detailed account in article 253 in the comp.speech
- archives - also available as file info/DIY_Speech_Recognition.
-
- The first part is reproduced here.
-
- QUICKY RECOGNIZER sketch:
-
- Here is a simple recognizer that should give you 85%+ recognition
- accuracy. The accuracy is a function of WHAT words you have in
- your vocabulary. Long distinct words are easy. Short similar
- words are hard. You can get 98+% on the digits with this recognizer.
-
- Overview:
- (1) Find the begining and end of the utterance.
- (2) Filter the raw signal into frequency bands.
- (3) Cut the utterance into a fixed number of segments.
- (4) Average data for each band in each segment.
- (5) Store this pattern with its name.
- (6) Collect training set of about 3 repetitions of each pattern (word).
- (7) Recognize unknown by comparing its pattern against all patterns
- in the training set and returning the name of the pattern closest
- to the unknown.
-
- Many variations upon the theme can be made to improve the performance.
- Try different filtering of the raw signal and different processing methods.
-
- ------------------------------------------------------------------------
-
- Q6.2: What does speaker dependent/adaptive/independent mean?
-
- A speaker dependent system is developed (trained) to operate for a single
- speaker. These systems are usually easier to develop, cheaper to buy and
- more accurate, but are not as flexible as speaker adaptive or speaker
- independent systems.
-
- A speaker independent system is developed (trained) to operate for any
- speaker or speakers of a particular type (e.g. male/female, American/English).
- These systems are the most difficult to develop, most expensive and currently
- accuracy is not as good. They are the most flexible.
-
- A speaker adaptive system is developed to adapt its operation for new
- speakers that it encounters usually based on a general model of speaker
- characteristics. It lies somewhere between speaker independent and speaker
- dependent systems.
-
- Each type of system is suited to different applications and domains.
-
- ------------------------------------------------------------------------
-
- Q6.3: What does small/medium/large/very-large vocabulary mean?
-
- The size of vocabulary of a speech recognition system affects the complexity,
- processing requirements and the accuracy of the system. Some applications
- only require a few words (e.g. numbers only), others require very large
- dictionaries (e.g. dictation machines).
-
- There are no established definitions but the following may be a helpful guide.
-
- small vocabulary - tens of words
- medium vocabulary - hundreds of words
- large vocabulary - thousands of words
- very-large vocabulary - tens of thousands of words.
-
- ------------------------------------------------------------------------
-
- Q6.4: What does continuous speech or isolated-word mean?
-
- An isolated-word system operates on single words at a time - requiring a
- pause between saying each word. This is the simplest form of recognition
- to perform, because the pronunciation of the words tends not affect each
- other. Because the occurrences of each particular word are similar they are
- easier to recognise.
-
- A continuous speech system operates on speech in which words are connected
- together, i.e. not separated by pauses. Continuous speech is more difficult
- to handle because of a variety of effects. First, it is difficult to find
- the start and end points of words. Another problem is "coarticulation".
- The production of each phoneme is affected by the production of surrounding
- phonemes, and similarly the the start and end of words are affected by the
- preceding and following words. The recognition of continuous speech is also
- affected by the rate of speech (fast speech tends to be harder).
-
- ------------------------------------------------------------------------
-
- Q6.5: How is speech recognition done?
-
- A wide variety of techniques are used to perform speech recognition.
- There are many types of speech recognition. There are many levels of
- speech recognition/processing/understanding.
-
- Typically speech recognition starts with the digital sampling of speech.
- The next stage would be acoustic signal processing. Common techniques
- include a variety of spectral analyses, LPC analysis, the cepstral transform,
- cochlea modelling and many, many more.
-
- The next stage will typically try to recognise phonemes, groups of phonemes
- or words. This stage can be achieved by many processes such as DTW (Dynamic
- Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), and
- sometimes expert systems. In crude terms, all these processes to recognise
- the patterns of speech. The most advanced systems are statistically
- motivated.
-
- Some systems utilise knowledge of grammar to help with the recognition
- process.
-
- Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to
- process the speech input.
-
- Some systems try to "understand" speech. That is, they try to convert the
- words into a representation of what the speaker intended to mean or achieve
- by what they said.
-
- ------------------------------------------------------------------------
-
- Q6.6: What are some good references/books on recognition?
-
- Some general introduction books on speech recognition:
-
- Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang
- Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993
- ISBN 0-13-015157-2
-
- Speech recognition by machine; W.A. Ainsworth
- London: Peregrinus for the Institution of Electrical Engineers, c1988
-
- Speech synthesis and recognition; J.N. Holmes
- Wokingham: Van Nostrand Reinhold, c1988
-
- Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal Processing,
- 1987.
-
- Electronic speech recognition: techniques, technology and applications
- edited by Geoff Bristow, London: Collins, 1986
-
- Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee.
- San Mateo: Morgan Kaufmann, c1990
-
- More specific books/articles:
-
- Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack.
- Edinburgh: Edinburgh University Press, c1990
-
- Automatic speech recognition: the development of the SPHINX system;
- by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989
-
- Prosody and speech recognition; Alex Waibel
- (Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988
-
- S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the
- Application of the Theory of Probabilistic Functions of a Markov Process
- to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4),
- pp1035--1074, April 1983
-
- R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in
- Neural Computation, v1(1), pp 1-38, 1989.
-
- ------------------------------------------------------------------------
-
- Q6.7: What speech recognition packages are available?
-
- Information is included below on the following packages:-
-
- Voice Blaster Ver. 4.0
- Votan
- HTK (HMM Toolkit)
- DragonDictate
- VoiceServer for Windows
- IN3 Voice Command for Windows
- IN3 Voice Command
- SayIt
- Recnet
- Voice Command Line Interface
- DATAVOX
-
-
- Package Name: Voice Blaster Ver. 4.0
- Platform: IBM AT or higher, DOS or Wndows 3.1
- Description: Uses a Sound Blaster or compatible board. Contains a
- microphone headset and a connector for LPT1:. A printer can
- still be used on LPT1:. Will recognize 1024 words that are
- trained by the operator. Each word activates a macro that can
- enter an ascii word on the screen or into a word processor or
- invoke a batch file. An optional footswitch may be installed.
- Software to run under DOS or Windows 3.1 is included.
- Cost: Around $150 Canadian.
- Contact: COVOX Inc.
- 675 Conger Street
- Eugene, Oregon
- 97402
- Ph: (503) 342-1271 Fax: (503) 342-1283
- BBS: (503) 342-4135
-
-
- Package Name: Votan
- Platform: MS-DOS, SCO UNIX
- Description: Isolated word and continuous speech modes, speaker dependant
- and (limited) speaker independent. Vocab size is 255 words or up to a
- fixed memory limit - but it is possible to dynamically load different
- words for effectively unlimited number of words.
- Rough Cost: Approx US $1,000-$1,500
- Requirements: Cost includes one Votan Voice Recognition ISA-bus board
- for 386/486-based machines. A software development system is also
- available for DOS and Unix.
- Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users.
- A telephone interface is also available. There is also a 4GL and a
- software development system.
- Apparently there is more than one version - more info required.
- Contact: 800-877-4756, 510-426-5600
-
-
- Package Name: HTK (HMM Toolkit) - From Entropic
- Platform: Range of Unix platforms.
- Description: HTK is a software toolkit for building continuous density HMM
- based speech recognisers. It consists of a number of library
- modules and a number of tools. Functions include speech analysis,
- training tools, recognition tools, results analysis, and an
- interactive tool for speech labelling. Many standard forms of
- continuous density HMM are possible. Can perform isolated word or
- connected word speech recognition. It van model whole words, sub-
- word units. Can perform speaker verification and other pattern
- recognition work using HMMs. HTK is now integerated with the
- ESPS/Waves speech research environment which is described in
- Section 1.8 of this posting.
- Misc: The availability of HTK changed in early 1993 when Entropic obtained
- exclusive marketing rights to HTK from the developers at Cambridge.
- Cost: On request.
- Contact: Entropic Research Laboratory, Washington Research Laboratory,
- 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
- (202) 547-1420. email - info@wrl.epi.com
-
-
- Package Name: DragonDictate-30K
- Platform: PC
- Description: Speaker dependent/adaptive system requiring words to be
- separated by short pauses. Vocabulary of 25,000 words including
- a "custom" word set.
- Rough Cost: $5000
- Requirements: Minimum of 20 Mhz 386 with 8M memory and 10M disk space
- Contact: Dragon Systems Inc.
- 90 Bridge Street, Newton MA 02158
- Tel: 1-617-965-5200, Fax: 1-617-527-0372
-
-
- Package Name: VoiceServer for Windows
- Platform: PC
- Description: Speaker dependent, each with an independent directory.
- Isolated word. Upto 1000 words/user, 300 words/window.
- 1 word occupies 2Kb on hard disk.
- Can be used to control Windows applications by issuing
- voice commands instead of menu selection.
- Rough Cost: 292 Pounds(UK)
- Requirements: None
- Misc: Price includes a half-sized AT voice card (including a
- DSP), software, documentation & a microphone (attachable to
- keyboard or speaker). A light-weight high-spec headset is an
- optional extra.
- Contact: Mark Redwood
- Applied Voice Technologies
- 26 Danbury Street, Islington,
- London, UK, N1 8JU
- Ph: + 44 71 454 1224 : Fax: + 44 71 454 1225
-
-
- Package Name: IN3 Voice Command for Windows
- Platform: PC with Windows 3.1
- Description: IN3 is now available for MS-Windows. Users can call
- applications to the foreground with voice commands. Once the
- application is called, the user may enter commands and data with
- voice commands. Voice macros can reduce the strain of repetitive
- stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by
- replacing heavy repetitive keyboard hammering with simple voice
- operations. Voice macros take complex operations and reduce them
- to simple verbal commands. Voice input can provide new facilities
- for tasks which could not easily have been otherwise performed
- without the multiple axis of input. IN3 is hardware-independent,
- users with any Windows-compatible audio add speech recognition to
- the desktop. IN3 works with either 8 bit or 16 bit Windows audio
- boards. IN3 is based on continuous word-spotting technology. A
- developer API is also available for creating voice-enabled
- applications.
- Price: $179 U.S.
- Requirements: PC with 80386 processor or better, Microsoft Windows 3.1, and
- Windows compatible audio system with microphone.
- Misc: Fully functional demos are available on Compuserve in various
- Multimedia and CAD forums. Demos are also available from "America
- on Line", the comp.binaries.ms-windows archive sites, and various
- BBS systems. It is also available by anonymous ftp as
- ftp.wustl.edu:/usenet/comp.binaries.ms-windows/v3/in3demo.zip
- ftp.uwasa.fi:/mirror/ultrasound/demo/in3demo.zip
- An equivilant Sun product is described below.
- Contact: Brantley Kelly
- Email: cbk@gacc.atl.ga.us CIS: 75120,431
- FAX: 1-404-925-7924 Phone: 1-404-925-7950
- Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
-
-
-
- Package Name: IN3 Voice Command
- Platform: Sun SPARCstation
- Description: IN3 provides a secure, robust, word spotting, continuous
- speech recognition facility for the Sun OS or Solaris operating
- systems. The recognition system is a secure operating system
- facility capable of working with various interfaces, microphones,
- and devices. The operating system interface works with native UNIX
- outside of X Windows as well as provides enhanced X Windows facilities
- including named window support. The user interface provides a
- means to quickly create commands on the fly for replacing long strings
- and complex operations with voice macros. [Voice macros can reduce
- the strain of repetitive stress injuries (RSI) such as Carpel Tunnel
- Syndrome (CTS) by replacing heavy repetitive keyboard hammering with
- simple voice operations. ]
- The IN3 user interface works with generic X servers and window
- managers. A developer API is also available for creating voice-
- enabled applications, interfacing with other audio sources, and
- providing extensive application control over the recognition facility.
- Availability: SunSite archive at SunSITE.unc.edu as well as on Catalyst
- CDware as both a runable demo and unlockable software.
- Hardware Required: Sun SPARCstation with audio input.
- Noise canceling microphone recommended but not required.
- Software Required: Sun OS 4.1.2 with OpenWindows 3.0 or
- Sun OS 4.1.3 or
- Solaris 2.1 or Solaris 2.2
- Misc: An equivilant MS-Windows product is described above.
- Price: $495 U.S.
- Contact: Brantley Kelly
- Email: cbk@gacc.atl.ga.us CIS: 75120,431
- FAX: 1-404-925-7924 Phone: 1-404-925-7950
- Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
-
-
- Package Name: Phonetic Engine 400 (PE400) - Speech Systems, Inc.
- Platform: PC
- Description: Speaker independent, large vocabulary, continuous speech
- recognition for MS Windows or DOS.
- Rough Cost: $1195 US dollars. Includes board, microphone, developer kit,
- documentation, 2 days of technical training and 90 days of
- technical support.
- Requirements: IBM AT class machine or better plus 5M disk space. Most
- processing is performed on-board (4M standard or 16M upgrade).
- Misc: Requires developer to provide a context-free grammar.
- Vocabulary size unknown (quotes from 500 - 2000 words per grammar),
- but dynamic grammar switching capabilities may increase the
- effective vocabulary size.
- Development system includes lower-level C,C++ library (VoiceLib),
- higher-level DLL (SPOT) callable from many languages, SPOT/VBX,
- a custom control for Visual Basic and Visual C++.
- Contact: Speech Systems, Inc.
- 2945 Center Green Court South
- Boulder, CO 80301-2275, USA
- Tel: 303.938.1110 Fax: 303.938.1874
-
-
- Package Name: SayIt
- Platform: Sun SPARCstation
- Description: Voice recognition and macro building package for Suns
- in the Openwindows 3.0 environment. Speaker dependent discrete speech
- recognition. Vocabularies can be associated to applications and the
- active vocabulary follows the application that has input focus.
- Macros can include mouse commands, keystrokes, Unix commands,
- sound, Openwindow actions and more.
- An evaluation copy is available by email.
- Hardware: Microphone required (SunMicrophone is fine).
- Cost: $US295
- Contact: Phone: 1-800-245-UNIX or 1-415-572-0200
- Fax: 1-415-572-1300
- Email: info@qualix.com
-
-
- Package Name: recnet
- Platform: UNIX
- Description: Speech recognition for the speaker independent TIMIT and
- Resource Management tasks. It uses recurrent networks to estimate
- phone probabilities and Markov models to find the most probable
- sequence of phones or words. The system is a snapshot of evolving
- research code. There is no documentation other than published
- research papers. The components are:
- 1. A preprocessor which implements many standard and many non-
- standard front end processing techniques.
- 2. A recurrent net recogniser and parameter files
- 3. Two Markov model based recognisers, one for phone recognition
- and one for word recognition
- 4. A dynamic programming scoring package
- The complete system performs competatively.
- Cost: Free
- Requirements: TIMIT and Resource Management databases
- Contact: ajr@eng.cam.ac.uk (Tony Robinson)
- Availability: by FTP from "svr-ftp.eng.cam.ac.uk" as /misc/recnet-1.3.tar.Z
-
-
- Package Name: Voice Command Line Interface
- Platform: Amiga
- Description: VCLI will execute CLI commands, ARexx commands, or ARexx
- scripts by voice command through your audio digitizer. VCLI allows
- you to launch multiple applications or control any program with an
- ARexx capability entirely by spoken voice command. VCLI is fully
- multitasking and will run in the background, continuously listening
- for your voice commands even while other programs are running.
- Documentation is provided in AmigaGuide format.
- VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0.
- Cost: Free?
- Requirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic,
- and Generic audio digitizers.
- Availability: by ftp from wuarchive.wustl.edu in the file
- systems/amiga/incoming/audio/VCLI60.lha and from
- amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha
- Contact: Author's email is RHorne@cup.portal.com
-
-
- Package Name: DATAVOX - French
- Platform: PC
- Description: Continuous speech - speaker independent or dependent.
- Rough Cost: ?
- Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an
- A/D - D/A module (ASA116)
- Misc: Application software may dialog with DATAVOX through 2 types
- of interfaces :
- 1) Keyboard overlay
- The application software may be used with any PC compatible
- package. No specific adaptation is necessary, you only need
- to define your configuration with the application software.
- 2) C library
- Allows a user-written program to drive the recognition system.
- DATAVOX is based on the AMADEUS speech recognition software
- developed at LIMSI. It provides
- - Continuous speech recognition with
- * speaker dependant : 500 words
- * speaker independant : 50 words (custom-made vocabulary).
- - Grammar of the application language (syntax acquisition,
- verification and simplification software).
- - Large vocabulary : DATAVOX can recognize vocabularies of several
- thousand words as long as there are no more than 500 words in the
- active vocabulary at any given node. It takes less than 1 second
- to change syntax and vocabulary.
- - Training controlled by the system (use of co-articulation models).
- - Response time less than 500 ms for any phrase length.
- - Synthetis (ADPCM) can be heard simultaneously while recognition
- is being carried out.
- Contact: VECSYS, Le Chene rond, 91570 Bievres, France
- Fax: 33 1 69 41 24 30
- Voice: 33 1 69 41 15 04
-
-
- Package: PowerSecretary
- Platform: Mac
- Price: $US5,000 (including a Centris or Quadra AV)
- Availability: Articulate Systems Inc.
- 600 W. Cummings Park, Suite 4500
- Woburn, MA 01801
- Ph: (617) 935-5656 Fax: (617) 935-0490.
-
-
- Pacakge: ICSS system from IBM
- Description: A large vocabulary, speaker independent, continuous speech
- system which runs under Windows, OS/2, and AIX.
- Requirements: Soundboard (e.g. Soundblaster)
- Price: ?
- Contact: ?
-
-
- Package: Creative VoiceAssist
- Platform: PC (?)
- Price: $US99.95
- Contact: Creative Labs
- Ph: 1-800-998-5227
-
-
- Package Name: xxx
- Platform: PC, Mac, UNIX, Amiga ....
- Description: (e.g. isolated word, speaker independent...)
- Rough Cost: (if applicable)
- Requirements: (hardware/software needs - if applicable)
- Misc:
- Contact: (email, ftp or address)
-
-
- Can anyone provide info on
-
- Verbex Listen for Windows
- Voice Navigator (from Articulate Systems)
- SRI Recognisers
- BBN Recognisers
-
-
- Can you provide information on any other software/hardware/packages?
- Commercial, public domain and research packages are all appropriate.
-
-
-
-
- Andrew Hunt
- Speech Technology Research Group Ph: 61-2-692 4509
- Dept. of Electrical Engineering Fax: 61-2-692 3847
- University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au
-